library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(pacman)
This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.
Try executing this chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Ctrl+Shift+Enter.
plot(cars)
Add a new chunk by clicking
2. Press
Ctrl+Alt+I 3. Just type out the formatting, e.g. three back
ticks, curly brackets containing
r, and three more back
ticks to finish the block
When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the Preview button or press Ctrl+Shift+K to preview the HTML file).
one <- 1
one
## [1] 1
RStudio shows you four panes:
You can arrange these in any order using Tools > Global Options.
RStudio has fantastic autocomplete capabilites. To autocomplete just press TAB. This is especially useful when loading files as using autocomplete will help you to identify relevant ones.
R needs to know which folder on your computer you are working in. There are various ways to do this.
Within
RStudio go to Session >> Set Working Directory > Choose Directory...
You also have an option to set the working directory to the most recently-loaded .R or .Rmd file.
To do this type
setwd("path/to/directory")
Unfortunately, if you are on a windows machine you will need to
change all backslashes \ to forward slashes /.
This is because R follows UNIX conventions which are native to Linux and
Mac computers.
If you are not sure what your working directory is type
getwd()
Getting the right path is a vital first step in R, and you really need to know how to do this. Here are some instructional videos if you get stuck…
For Windows computers refer to this YouTube video. For Macs refer to this YouTube video. If you’re on Linux then refer to this YouTube video
However this is a big problem with this approach. If you send your file to someone else who is working on a different machine, it is most likely that your path will be pointing to the wrong location
By far the best way to set the working directory is create and “R Project” in RStudio. When you do this, the RProject stores keeps track of the directory where the files are stored, so there is no need to set the working directory.
In addition, if you set up a project you can use it to store all the datasets and other data objects which were opened in the last session. This is quite convenient but it can take up a lot of memory, and you also run the risk that you will lose track of how the variables were created, which is a threat to reproducibility.
We can use the console for general arithmetic
1 + 3 + 5
## [1] 9
(2 + 9) / 7
## [1] 1.571429
1 + 2+ 3
## [1] 6
We can also create variables, e.g.
x <- 10 # This is a comment
x = 10 # Does the same thing!!
y <- 21
x*y
## [1] 210
Most work in R is done using Functions. These take the following form: function(argument(s)). Here are some functions
sqrt(10)
## [1] 3.162278
seq(1, 10, 2)
## [1] 1 3 5 7 9
rep(5, 10)
## [1] 5 5 5 5 5 5 5 5 5 5
EX1: What do the arguments of seq and rep
do? To find out more search for the relevant help file in the console by
typing ?seq or by using Google.
EX2: Have a look at the following arguments called gsub
and grepl. What do they do? Clue: if you’re stuck, search
the help file using ?
gsub("R-studio", "Rstudio", "R-studio is a great piece of software")
## [1] "Rstudio is a great piece of software"
grepl("chocolate", "Mary likes chocolate cookies")
## [1] TRUE
It’s possible to create your own functions. This makes R extremely powerful and extendible. We’re not going to cover making your own functions in this course, but it’s important to be aware of this capability. There are plenty of good resources online for learning how to do this, including this one
As we have seen above, to find out about a particular function just
type ? and the name of the function into the console,
e.g. ?grepl. This accesses the help files on your computer.
If you’d like to search more broadly type ??grepl and your
computer will look online for relevant materials on CRAN (the main R
website)
Help files in R are quite densely written and not particularly aimed at beginners. Fortunately there are loads of excellent resources on the internet. Here are some really good sites:
And there are plenty more! If you find a good one share it with your colleagues via email, Twitter, or whatever social media you prefer!
To enhance the basic capabilities of R, we need to load
packages/libraries. Most of the time, we download these from ‘CRAN’
Tools > Install packages or
install.packages(). Once the package/library is installed
(i.e. it is sitting somewhere on your computer), we then need to
load it to the current R session using the
library() function.
Remember using a package/library is a two-stage process. We
Install the package/library onto your computer (from
the internet)Load the package/library into your current session
using the library command.One of the most useful packages is called ‘tidyverse’.
If your first code block is called “settings” then the code for that block will be automatically run when you open the file. So it is useful to load all of your packages in this first block.
One package I always load at the start is the
tidyverse
It contains a number of useful commands for plots, and data manipulation.
Note that
Install the ‘tidyverse’ package, and then load it with the following function:
library(tidyverse)
I find that a particularly easy way to load packages is via the
p_load function from the pacman library. This
will check if the package has been loaded into the current session. If
not it will search to find out if the package has been stored in your
computer. If the package is not in your computer it will automatically
find it. It basically does everything you need! However, we are not
going to practise using it, but just to let you know that it exists!
To find out more about a package type ?package_name in
the console. Alternatively you can look for the package documentation on
CRAN.
Most of the functions loaded in a package should work ‘out of the
box’. However occasionally you need to refer to the package first, and
then the function using the format
package_name::function_from_that_package. This is useful
for a variety of reasons:
ggplot2geom_point from this package. What
does it do?A variable is a type of ‘object’ which R stores in memory. R is
capable of creating and storing a wide range of objects. To see what
type of object we have created, we use the function
class(), e.g.
x <- 1
class(x)
## [1] "numeric"
z <- "hello"
class(z)
## [1] "character"
class is one of the most useful functions in R as errors are often due to misassignment of class, e.g.
x + z
## Error in x + z: non-numeric argument to binary operator
Here we have tried to add a number to a string which is clearly
impossible. It’s possible to change the class of an object using
commands such as as.character, as.integer,
as.numeric, as.factor, e.g.
one <- "1"
x + one
## Error in x + one: non-numeric argument to binary operator
one <- as.numeric(one)
x + one
## [1] 2
Here is a list of the main object classes in R:
In order to create a vector we need to use the c
function. (c = ‘combine’), e.g.
list.of.numbers <- c(1,4,54,22,43,9,0,0,21)
mean(list.of.numbers)
## [1] 17.11111
sd(list.of.numbers)
## [1] 19.85223
a.character.vector <- c("Mary", "Jane", "Ali", "Chen")
a.list <- as.list(c(1, 2, "Mary", "Jane"))
A data frame is a two-dimensional object containing variables and row numbers. It’s basically a spreadsheet.
The following code creates a data frame programmatically. It creates
two variables, and combines them together to make a data frame. Note
that to do this we need to use the functions as.data.frame
and cbind.
list.of.movies <- c("Independence Day", "Pretty Woman", "The Godfather Part
Two", "Planet of the Apes (original)")
rotten.tomatoes.variable <- c(62, 61, 97, 89)
df <- as.data.frame(cbind(list.of.movies, rotten.tomatoes.variable)) # 'cbind' binds columns together
To glimpse the top few rows type
head(name_of_data_frame) in the console, e.g.
head(df)
To view the data frame in the ‘source’ window, type
View(name_of_data_frame) in the console, .e.g.
View(df) #NB first letter is a capital letter.
To refer to variables, use the following syntax
data_frame_name$variable_name, e.g.
df$list.of.movies
## [1] "Independence Day" "Pretty Woman"
## [3] "The Godfather Part\nTwo" "Planet of the Apes (original)"
When naming variables we can use dots and underscores,
e.g. df$list.of.movies and df$list_of_movies.
We can use numbers as long as they don’t come at the beginning,
e.g. df$list_of_movies.v3.
If you use this convention, then the names for variables can get very long. However, it’s generally useful, as in R you often have multiple data frames loaded into memory. By specifiying both the name of the data frame and the variable, this avoids confusion.
Try to be consistent with your naming conventions. I tend to use
underscores to name variables,
e.g. data.frame.x$variable_y. This is also what Hadley
Wickham recommends (Have a look at the Tidyverse Style Guide)
If you’d like to see all the variable names in a data frame type
names(data_frame), e.g.
names(df)
## [1] "list.of.movies" "rotten.tomatoes.variable"
Whenever you wish to access the contents of an object with multiple
values (e.g. a data frame) you use indexes. These are placed inside
square brackets, e.g. [1]. Have a look at the following
example:
vector = c(1, 4, 2, 99, 0.5, 10)
vector[1]
## [1] 1
vector[2]
## [1] 4
vector[5]
## [1] 0.5
vector[2:5]
## [1] 4.0 2.0 99.0 0.5
mean(vector[2:5])
## [1] 26.375
Here is how we would use vectors with a datafrome
df[1,2]
## [1] "62"
df[1,] # here the second number is blank
df[,2] # here the first number is blank
## [1] "62" "61" "97" "89"
What does each number refer to? What happens when we leave a blank cell?
However, rather than use the menu, it’s much better to use actual code, as this will automate the process. Let’s import a dataset on World Happiness Report (2017), by country. The files are WHR_2017.xlsx, WHR_2017.sav, and WHR_2017.csv. Alternatively you can actually download the data set straight from the URL (below)
# This code depends on using the `tidyverse` and `haven` packages
# If you have not imported / loaded them, you need to so
# library(tidyverse)
# library(haven)
df <- readxl::read_excel("WHR_2017.xlsx") # Read an excel file
df <- haven::read_spss("WHR_2017.sav") # Read from an SPSS file
df <- read_csv("WHR_2017.csv") # Read from a .csv file
#Or to download straight from the URL!!
df <- read_csv("https://verbingnouns.github.io/AdventuresInR/docs/WHR_2017.csv")
Possibly the best data format to work in is the .csv
data format (Comma-Separated Value). This is good because it is readable
in Excel, small, simple, and not easily-corrupted. It’s only
disadvantage is that it does not support Excel formulas or formatting
(NB some would say this is an advantage as Excel formulae are not that
reliability and not very consistent with scientific reproducibility)
To read .csv files we use the read.csv() function from
base R, or read_csv() from the tidyverse (I would go with
the latter as it also shows you a list of the variable types)
We’re going to subset the WHR dataset (i.e. choose only
those cases/observations which fulfil a specific criterion). To do this
we’re going to use the which() function. When you apply
which to a variable in a dataset, it will produce indices
(indexes) of the rows which fulfil a certain criterio,
e.g. which(df$var_name == 2) will give you the indices of
all rows where the value of the variable is 2.
Armed with this knowledge, your task is to subset the data frame so that it only contains information from African countries.
If you’re stuck have a look at the answer below.
df.Africa <- df[which(df$region == "Africa"), ]
Okay, the above code is pretty horrible, so we’re going to explore an
alternative using the package dplyr which is from the
tidyverse. But before we can use dplyr we have
to learn how to ‘pipe’.
Pipes are written in R as %>% (note you must use a
percentage sign before and after the pipe). To demonstrate what pipes
do, I have a look at the following pseudocode.
All pipes do is enable us to ‘pass’ a data frame (or another object) to a new function without having to keep on specifying the data frame. In addition, we can chain pipes together indefinitely.
Here’s how we would subset the data frame using piping:
df.Africa <- filter(df, region == "Africa") # This is the version without piping
df %>% filter(region == "Africa") -> df.Africa # This is the version with piping. It looks longer, but we can chain multiple functions together!
Note that to create a new data frame, we need a solid arrow at the
end. If we don’t include that solid arrow, the results are shown in the
console, but no new data frame is created. This is an incredibly useful
feature of pipes. You can try before you buy!
And here is an example where we chain a series of pipes together:
df %>%
group_by(region) %>%
summarise(mean.happiness = mean(happiness_score)) ->
df.mean.happiness.by.region
NB When piping the code becomes more readable when the line ends with the pipe.
There are a couple of important points to note.
->. If we don’t store the results they will merely be
displayed in the console.Piping is a key technique in R and once you’ve learnt it you will write much more powerful and readable code.
As well as using pipes to create data frame, you can also insert pipes into both analyses and figures! Here is an example of a pipe inserted within an ANOVA.
# An ANOVA without a pipe. NB we are using the base function "aov". If you would like to conduct SPSS-style ANOVAs, the best package is called "afex".
mod <- aov(happiness_rank ~ region, data = df)
pacman::p_load(broom) # To load the "tidy" function.
tidy(mod)
# Here we use a pipe inside the analysis
mod <- aov(happiness_rank ~ region, # NB note we can break the line after a comma
data = df %>% filter(region == "Africa" | region == "South America"))
tidy(mod)
Note how I have broken some of the lines after a comma. This makes the code more readable. Generally we can break a line when it ends in some kind of symbol, e.g. a pipe, an arrow, or a comma.
Loops and if-then statements are useful programming tools which have
the same structure: FUNCTION (STATEMENT) {.....}.
for(i in 1:10){
print(as.character(i))
}
## [1] "1"
## [1] "2"
## [1] "3"
## [1] "4"
## [1] "5"
## [1] "6"
## [1] "7"
## [1] "8"
## [1] "9"
## [1] "10"
The code below creates a sequence ranging from 0 to 30 going up in steps of 0.25. Try to achieve the same result using a loop
seq(0,30,2.5)
## [1] 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 22.5 25.0 27.5 30.0
To demonstrate if-then statements, we are going to create a new variable which shows if the happiness index is above the mean
df$happiness_above_mean <- 0 # Set variable to 0
mean_happiness <- mean(df$happiness_score) # Calculate mean mpg
for (i in 1:nrow(df)){
if(df$happiness_score[i] > mean_happiness){df$happiness_above_mean[i] <- 1}
}
And here is the same process using dplyr, which avoids
the loop and the if-then statement.
df %>%
mutate(happiness_above_mean = as.numeric(happiness_score > mean(happiness_score))) ->
df
Note loops and if-then statements are quite verbose, and there is almost always a neater and much shorter alternatives. However, I think they are useful procedures for the relative beginner.
Here is the way to create the variable without using the tidyverse
df <- read_csv("WHR_2017.csv")
df$happiness_above_mean <- as.numeric(df$happiness_score > mean(df$happiness_score))
So how does this work? The statement in brackets evaluates to TRUE /
FALSE. We then turn this into a number using as.numeric.
TRUE evaluates to 1, while FALSE evaluates to 0.
It can be quite useful to chain statements. For example, if we wish to identify countries where both the happiness score and life expectancy are above the mean, we could do this….
df$happiness_and_LE_above_mean <- as.numeric((df$happiness_score > mean(df$happiness_score)) & (df$life_expectancy > mean(df$life_expectancy)))
## EX6: Creating variables
Try to identify countries where both the GDP per capita and trust in the government are above the mean.
Whenever you run an analysis in R and save that to an object, the object has an internal structure. To demonstrate this, let’s do a simple regression using the mtcars dataset:
df <- read.csv("WHR_2017.csv")
head(df)
Let’s draw a plot looking at the relationship between GDP per capita and Happiness Score. We’re not going to focus on the code, which will be covered in the next session.
g <- ggplot(aes(x = df$gdp_per_capita, y = df$happiness_score), data = df)
g <- g + geom_point()
g <- g + geom_smooth()
g
## Warning: Use of `df$gdp_per_capita` is discouraged.
## ℹ Use `gdp_per_capita` instead.
## Warning: Use of `df$happiness_score` is discouraged.
## ℹ Use `happiness_score` instead.
## Warning: Use of `df$gdp_per_capita` is discouraged.
## ℹ Use `gdp_per_capita` instead.
## Warning: Use of `df$happiness_score` is discouraged.
## ℹ Use `happiness_score` instead.
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Now let’s run a regression. Again, we’re not going going to focus on the code, which will be covered in the final session.
mod <- lm(happiness_score ~ gdp_per_capita, data = df) # mod = "model"
pacman::p_load(broom) # Broom is a package which produces neat tables of results
tidy(mod) # This is a broom function which tidies up the statistical results for reporting
Now, let’s have a look at the structure of this model.
There are two ways to do this:
str function, e.g. str(mod)mod$, and then use autocomplete.We can see that the $ symbol has a dual function in R:
firstly, to specify variables within dataframes, and secondly to specify
subcomponents of an object.
It is useful to be able to refer to subcomponents of an object so that we can integrate into our report, e.g. the regression yielded a value of 0.6601055
Once you get stuck have a look at the first code chunk below. This contains the solution but with pesky errors added! See if you can sort out the errors.
whr <- read_csv("WHR_2017.csv")
le <- read_csv("WHO_life_expectancy.csv")
whr %>% # NB we need to ensure that the "country" variable has exactly the same name in both datasets
rename(country = Country) ->
whr
le %>%
filter(Year == 2015) %>%
merge(whr) %>%
df
plot(df$`Life expectancy`, df$GDP)
plot(df$`Life expectancy`, df$family)
The code in this chunk shows the solution!
whr <- read_csv("WHR_2017.csv")
le <- read_csv("WHO_life_expectancy.csv")
whr %>%
rename(Country = country) ->
whr
le %>%
filter(Year == 2015) %>%
merge(whr) ->
df
plot(df$`Life expectancy`, df$GDP)
plot(df$`Life expectancy`, df$family)
Comments
If you’d like to comment on any code you write (i.e. you do not wish R to try to ‘run’ this code) just add a hash (
#) or series of hashes in front of it, e.g.df <- read.csv("csv_file.csv") # This reads in the main file for the experimentIt is good practice to make a lot of comments, especially when you are just starting out in R. They will help to you remind you of what the code does, and you could also use them to highlight any difficulties you have had, things you need to be careful about, useful online resources etc.